Gemma 3 在圖片處理方面的能力是其多模態特性的核心。它能夠理解圖片的內容、物件、情境,並基於這些視覺資訊與文字指令進行互動。這使得 Gemma 3 在許多視覺相關的任務中表現出色。
Gemma 3 圖片處理的核心能力
圖片理解與描述 (Image Captioning):
模型可以接收一張圖片,並生成詳細的文字描述,說明圖片中包含了哪些物件、人物、地點,以及他們之間的關係或正在發生的動作。
範例:你提供一張公園裡有人在遛狗的圖片,Gemma 3 可能會描述為:「一位女士牽著一隻狗在公園散步,背景有綠樹和草地。
# Install Transformers
!pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3
# Import libraries and dependencies
from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from PIL import Image
import cv2
from IPython.display import Markdown, HTML
from base64 import b64encode
import requests
import torch
# Choose the Gemma 3 model variant.
from google.colab import userdata
import os
model_name = 'gemma-3-4b-it' # @param ['gemma-3-1b-it', 'gemma-3-4b-it', 'gemma-3-12b-it', 'gemma-3-27b-it']
model_id = f"google/{model_name}"
model = Gemma3ForConditionalGeneration.from_pretrained(
model_id, device_map="auto", torch_dtype=torch.bfloat16, token=hf_token
).eval()
processor = AutoProcessor.from_pretrained(model_id, token=hf_token)
# Define helper functions
def resize_image(image_path):
img = Image.open(image_path)
target_width, target_height = 640, 640
# Calculate the target size (maximum width and height).
if target_width and target_height:
max_size = (target_width, target_height)
elif target_width:
max_size = (target_width, img.height)
elif target_height:
max_size = (img.width, target_height)
img.thumbnail(max_size)
return img
def get_model_response(img: Image, prompt: str, model, processor):
# Prepare the messages for the model.
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant. Reply only with the answer to the question asked, and avoid using additional text in your response like 'here's the answer'."}]
},
{
"role": "user",
"content": [
{"type": "image", "image": img},
{"type": "text", "text": prompt}
]
}
]
# Tokenize inputs and prepare for the model.
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)
input_len = inputs["input_ids"].shape[-1]
# Generate response from the model.
with torch.inference_mode():
generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
# Decode the response.
response = processor.decode(generation, skip_special_tokens=True)
return response
image_file = 'image_5.jpg' # @param {type: 'string'}
prompt = "Describe the image." # @param {type: 'string'}
img = resize_image(image_file)
display(img)
response = get_model_response(img, prompt, model, processor)
display(Markdown(response))
/usr/local/lib/python3.12/dist-packages/transformers/generation/configuration_utils.py:634: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.95` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
warnings.warn(
/usr/local/lib/python3.12/dist-packages/transformers/generation/configuration_utils.py:651: UserWarning: `do_sample` is set to `False`. However, `top_k` is set to `64` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_k`.
warnings.warn(
DevFest Taipei 2025, December 6th
Gemma 3 Describe an image 可以總結圖片的內容 ,效果還不錯 呀
你可以用它來:
自動生成內容簡介:為部落格文章或社交媒體圖片快速生成描述。
輔助視覺障礙人士:為他們提供更生動、更具體的圖片描述,幫助他們理解情境。
圖片分類與搜索:根據圖片內容自動生成關鍵字或標籤,便於後續的資料檢索。